Bandits with heavy tail 



Sebastien Bubeck* Nicolo Cesa-Bianchi^ Gabor Lugosi* 

September 11, 2012 

(N 

o 

Abstract 

p ; 

qj _ The stochastic multi-armed bandit problem is well understood when the reward 

C/3 ■ distributions are sub-Gaussian. In this paper we examine the bandit problem under 

00 ■ the weaker assumption that the distributions have moments of order 1 + e, for some 
e £ (0, 1]. Surprisingly, moments of order 2 (i.e., finite variance) are sufficient to obtain 

1 "| 1 regret bounds of the same order as under sub-Gaussian reward distributions. In order 
to achieve such regret, we define sampling strategies based on refined estimators of the 
mean such as the truncated empirical mean, Catoni's M-estimator, and the median-of- 

C$ ' means estimator. We also derive matching lower bounds that also show that the best 

c/2 ■ achievable regret deteriorates when e < 1. 

>'. 1 Introduction 

O- ■ In this paper we investigate the classical stochastic multi-armed bandit problem introduced 

by Robbins [1952] and described as follows: an agent facing K actions (or bandit arms) 
selects one arm at every time step. With each arm i G {1, . . . ,K} there is an associated 
{Sj \ probability distribution v\ with finite mean /ij. These distributions are unknown to the agent. 

At each round t = 1, . . . , n, the agent chooses an arm I t , and observes a reward drawn from 
uj t independently from the past given I t . The goal of the agent is to minimize the regret 



X 
S3 



R n = n max — E \ii t 



t=i 



We refer the reader to Bubeck and Cesa-Bianchi [2012] for a survey of the extensive 
literature of this problem and its variations. The vast majority of authors assume that the 
unknown distributions are sub-Gaussian, that is, the moment generating function of each 
Vi is such that if A is a random variable drawn according to the distribution z/j, then for all 
A > 0, 

lnEe A ( x - EX )<— and In Ee A ^ < — (1) 
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where v > 0, the so-called "variance factor" is a parameter that is usually assumed to 
be known. In particular, if rewards take values in [0,1], then by Hoeffding's lemma, one 
may take v = 1/4. Similarly to the asymptotic bound of [Agrawal, 1995, Theorem 4.10], 
this moment assumption was generalized in [Bubeck and Cesa-Bianchi, 2012, Chapter 2] by 
assuming that there exists a convex function if; : K + — > K. such that, for all A > 0, 

lnEe x{x - EX ^ < V>(A) and In Ee^^" 1 ' < ^(A) . (2) 

Then one can show that the so-called ijj-UCB strategy (a variant of the basic UCB strategy 
of Auer et al. [2002]) satisfies the following regret guarantee. Let Aj = maxj = i rii ^ — Hi, 
and the Legendre-Fenchel transform of ip, defined by 

ijj*(e) = sup(Ae-^(A)) . 

Then ^-UCB 1 satisfies 

Rn< 

i: A;>0 

In particular, when the reward distributions are sub-Gaussian, the regret bound is of the 
order ^(logn)/Aj, which is known to be optimal even for bounded reward distributions, 

see Auer et al. [2002]. 

While this result shows that assumptions weaker than sub-Gaussian distributions may 
suffice for a logarithmic regret, it still requires the distributions to have finite moment gen- 
erating function. Another disadvantage of the bound above is that the dependence on the 
gaps Aj deteriorates as the tail of the distributions become heavier. In fact, as we show it 
in this paper, the bound is sub-optimal when the tails are heavier than sub-Gaussian. 

In this paper we investigate the behavior of the regret when the distributions are heavy- 
tailed, and might not have a finite moment generating function. We show that under signifi- 
cantly weaker assumptions, regret bounds of the same form as in the sub-Gaussian case may 
be achieved. In fact, the only condition we need is that the reward distributions have a finite 
variance. Moreover, even if the variance is infinite but the distributions have finite moments 
of order 1 + e for some e > 0, one may still achieve a regret logarithmic in the number n 
of rounds though the dependency on the Aj's worsens as e gets smaller. For instance, for 
distributions with moment of order 1 + e bounded by 1 we derive a strategy that satisfies 

R n < J2 fsfi-Vlogn + SA, 

% . zA j ^0 \ 

The key to this result is to replace the empirical mean by more refined robust estimators of 
the mean and construct "upper confidence bound" strategies. 

We also prove matching lower bounds that show that the proposed strategies are optimal 
up to constant factors. In particular the dependency in 1/A^ e is unavoidable. 

In the following we start by defining a general class of sampling strategies that are based 
on the availability of estimators of the mean with certain performance guarantees. Then we 
examine various estimators for the mean. For each estimator we describe their performance 
(in terms of concentration to the mean) and deduce the corresponding regret bound. 

1 More precisely, (a,'0)-UCB with a = 4. 



( 4Ai , \ 

— , A , r lnn + 2 . 
U* Ai/2 J 




2 



2 Robust upper confidence bound strategies 



The rough idea behind upper confidence bound (ucb) strategies (see Lai and Robbins [1985], 
Agrawal [1995], Auer et al. [2002]) is that one should choose an arm for which the sum of 
its estimated mean and a confidence interval is highest. When the reward distributions all 
satisfy the sub-Gaussian condition (1) for a common variance factor v, then such a confidence 
interval is easy to obtain. Suppose that at a certain time instance arm i has been sampled 
s times and the observed rewards are X^i, . . . , %j )S . Then the Xi >r , r = 1, . . . , s are i.i.d. 
random variables with mean KX^ r = ^ and by a simple Chernoff bound, for any 5 G (0, 1), 
the empirical mean (1/s) X^Li^v satisfies, with probability at least 1 — 5, 



This property of the empirical mean turns out to be crucial in order to achieve a regret 
of optimal order. However, when the sub-Gaussian assumption does not hold, one cannot 
expect the empirical mean to have such an accuracy. In fact, if one only knows, say, that 
the variance of each Xi )T is bounded, then the best possible confidence intervals are signif- 
icantly wider, deteriorating the performance of standard UCB strategies. (See Appendix A 
for properties of the empirical mean under distributions of heavy tails.) 

The key to successful handling heavy-tailed reward distributions is to replace the empir- 
ical mean with other, more robust, estimators of the mean. All we need is a performance 
guarantee like the one shown above for the empirical mean. More precisely, we need a mean 
estimator with the following property. 

Assumption 1 Let e G (0, 1] be a positive parameter and let c, v be positive constants. Let 
X\, . . . , X n be i.i.d. random variables with finite mean \i. Suppose that for all 5 G (0, 1) there 
exists an estimator ju = Jl(n, 5) such that, with probability at least 1 — 5, 



For example, if the distribution of the X t satisfies the sub-Gaussian condition (1), then 
Assumption 1 is satisfied for e — 1, c = 2, and variance factor v. Interestingly, the assumption 
may be satisfied for significantly more general distributions by using more sophisticated mean 
estimators. We recall some of these estimators in the following subsections, where we also 
show how they satisfy Assumption 1. As we shall see, the basic requirement for Assumption 1 
to be satisfied is that the distribution of the X t has a finite moment of order 1+5. 

We are now ready to define our generalized robust UCB strategy, described in Figure 1. 
We denote by Ti(t) the (random) number of times arm % is selected up to time t. 

The following proposition gives a performance bound for the robust UCB policy provided 
that the reward distributions and the mean estimator used by the policy jointly satisfy 





and also, with probability at least 1 — 5 
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Robust UCB: 

Parameter: e G (0, 1], mean estimator fi(t,8). 

For arm i, define fii tSl t as the estimate ji(s, t~ 2 ) based on the first s observed values 
X^x, . . . , Xi jS of the rewards of arm i. Define the index 

, 2 \ e/(l+e) 



for s,t > 1 and B^o,* = +oo. 

At time t, draw an arm maximizing B^TJt-i)^- 



Figure 1: Robust UCB policy. 

Assumption 1. Below we exhibit several mean estimators that, under various moment as- 
sumptions, lead to regret bounds of optimal order. 

Proposition 1 Let e G (0, 1] and let ju(s, 5) be a mean estimator. Suppose that the distribu- 
tions ui, . . . , vk are such that the mean estimator satisfies Assumption 1 for all i = 1, . . . , K . 
Then the regret of the Robust UCB policy satisfies 

Rn< (2C (-^-V tog 71 + 5^ . (3) 



i: A;>0 



Also, if n is such that logn > maxj ^5A\ 1+£ ^ £ f (2cv 1 ^ e ) j , then 

R n < n^h [AKc log n) ^ v 1 /^ . (4) 



Note that a regret of at least Aj is suffered by any strategy that pulls each arm at least 

once. Thus, the interesting term in (3) is the one of the order of A >0 (f/Aj) e logn. We 
show below in Theorem 2 that this term is of optimal order under a moment assumption on 
the reward distributions. We also show in Theorem 2 that the gap-independent inequality (4) 
is optimal up to a logarithmic factor. 

Proof. Both proofs of (3) and (4) rely on bounding the expected number of pulls for a 
suboptimal arm. More precisely, in the first two steps of the proof we prove that, for any % 
such that Aj > 0, 

E7Kn)<2c-£^logn + 5. (5) 



To lighten notation, we introduce u 
equivalent to ETj(n) < u + 4. 



Note that, up to rounding, (5) is 



i 



First step. 

We show that if I t = i, then one the following three inequalities is true: either 

Bi*,Ti*(t-i),t < /A (6) 

or 

C log t 



or 



fe(*-D,* > * + ^ 1/(1+£) ( ) (7) 



m-l)<2c^ fJ - £ \ogn. (8) 



Indeed, assume that all three inequalities are false. Then we have 



Bi*,Ti(t-l),t > ft* 

= Hi + Aj 

,2 \ e/(14«) 

> /i, + 2^ 1+£ ) 



c log t 2 



Ti(t-l) 



= B iiTi (t-i),t 



c log t 2 

m-\) 



2 \ e/(l+e) 



which implies, in particular, that I t ^ i. 
Second step. 

Here we first bound the probability that (6) or (7) hold. By Assumption 1 as well as an 
union bound over the value of Tj»(t — 1) and Tj(t — 1) we obtain 

1 2 



s 4 - t r 



P((6) or (7) is true) < 2 ^ — < 

s=l 

Now using the first step, we obtain 



ET t (n) =EJ2 1/.=* <« + ^ 1 and (g) ig folse 



i=l t=u+l 



t=u+l 
n 



(6) or (7) is true 



x 2 



t 3 

t=u+l 



< M + 4 . 



This concludes the proof of (5). 
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Third step. 

Using that R n = Y^f=i AjETj(n) and (5), we directly obtain (3). On the other hand, for (4) 
we use Holder's inequality to obtain 



R n = A l (ET l (n))ifc(ET J (n))i^ 



i:Aj>0 



i:A;>0 



< Mm(n))^ [ 2c-^—^ £ logn + 5 

v l/e 
^(l+e)/e 



l+e 



< Yl Mm(n))Th Uc-^ log n 



l+e 



i: A;>0 

(by assumption on n) 



i 

l+e 



< K^h Y ETi(n) J (4c)^iV /(1+£) (logn)^ 

\i:Ai>0 / 

(by Holder's inequality) 

< n^^AKc\ogn^ 1+E v l,{l+e) . 



□ 



In the next sections we show how Proposition 1 may be applied, with different mean 
estimators, to obtain optimal regret bounds for possibly heavy-tailed reward distributions. 

2.1 Truncated empirical mean 

In this section we consider the simplest of the proposed mean estimators, a truncated version 
of the empirical mean. This estimator is similar to the "winsorized mean" and "trimmed 
mean" of Tukey, see Bickel [1965]. 

The following lemma shows that if the (1 + e)-th raw moment is bounded, then the 
truncated mean satisfies Assumption 1. 

Lemma 1 Let 5 G (0, l),e G (0, 1], and u > 0. Consider the truncated empirical mean 'fir 
defined as 



fir 



7/E|X| 1+e < u, then with probability at least 1 — 5, 



n 
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1 

Proof. Let B t = ( ^0=tj j 1+E • From Bernstein's inequality for bounded random variables, 
noting that E (X 2 1\x\<b) < uB 1 ^ 6 , we have, with probability at least 1 — 5, 

i 71 i n 1 n 

- - E X ^\x t \<B t = - E (EX - E (Xlpq< St )) + - E ( E ( X1 \x\<s t ) - X t l|*|<a) 
t=i t=i t=i 

= - E E ( X1 i^i>^) + - E ( E ( xt \x\<B t ) - x t i lXtl < Bt ) 



n * — ' ' n 

t=i t=i 



< I v — + ■ / 2jB «" £m log(<5_1) ■ Bn log(r1 ^ 

— r, RE 



n ^— ' Bf \ n 3n 

t=i 1 

An easy computation concludes the proof. □ 

The following is now a straightforward corollary of Proposition 1 and Lemma 1. 

Theorem 1 Let e G (0,1] and u > 0. Assume that the reward distributions Vi,...,uk 
satisfy 

E x ~ Vi \Xi\ 1+e <u \/ie{l,...,K} . (9) 

Then the regret of the Robust-VCB policy used with the truncated mean estimator defined 
above satisfies 



R n < E ( 8 (^V log 71 + 5 A* 



i : Ai>0 

When e — 1, the only assumption of the theorem above is that each reward distribution has 
a finite variance. In this case the obtained regret bound is of the order of ^(logn)/Aj, 
which is known to be not improvable in general, even when the rewards are bounded - 
note, however, that the KL-UCB algorithm of Garivier and Cappe [2011] is never worse than 
Robust- UCB in case of bounded rewards. We find it remarkable that regret of this order may 
be achieved under the only assumption of finite variance and one cannot improve the order 
by imposing stronger tail conditions. 

When the variance is infinite but moments of order 1+e are available, we still have a regret 
that depends only logarithmically on n. The bound deteriorates slightly as the dependency 
on 1/Aj is replaced by 1/A^ e . We show next that this dependency is inevitable. 

Theorem 2 For any A G (0,1/4), there exist two distributions V\ and u 2 satisfying (9) 
with u = 1 and with \i\ — fi 2 = A, such that the following holds. Consider an algorithm 
such that for any two-armed bandit problem satisfying (9) with u — 1 and with arm 2 being 
suboptimal, one has ET^n) = o(n a ) for any a > 0. Then on the two-armed bandit problem 
with distributions V\ and v 2 , the algorithm satisfies 

l iminf _^>^l. (10) 
n->+oo logn A; 

Furthermore, for any fixed n, there exists a set of K distributions satisfying (9) with u = 1 
and such that for any algorithm, one has 

R n > 0.01 K^n^ . (11) 
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Proof. To prove (10), we take V\ — (1 — 7 1+e )5o + 7 1+e ^i/7 with 7 = (2A)e, and z/ 2 — 
(1 + A7 — 7 1+e )5o + (7 1+£ — ^7)^1/7- It is easy to see that V\ and z/ 2 are well defined, and 
they satisfy (9) with u = 1 and /ii— /i 2 = A. Now clearly, the two-armed bandit problem with 
these two distributions is equivalent to the two-armed bandit problem with two Bernoulli 
distributions with parameters respectively 7 1+£ and r y l+e — A7. Slightly more formally, we 
could define a new algorithm A' that on Ber(7 1+e ), Ber(7 1+£ — A7) behaves equivalently to 
the original algorithm A on V\ and v 2 . Therefore, we can use [Bubeck, 2010, Theorem 2.7] 
to directly obtain the following lower bound for A', 

lim inf , 2 ^ - > 



n_>+oc log n - KL ^ Ber ^ 1+£ _ A ^ ^ Ber ( 7 l- 



where KL denotes Kullback-Leibler divergence. This implies the following lower bound for 
the original algorithm A 

lim inf - ^ n > 



log n XL (Ber(7 1+£ - A 7 ) , Ber(7 1+e )) 

Equation (10) then follows directly by using KL(Ber(p), Ber(g)) < j^jz^y along with straight- 
forward computations. 

The proof of (11) follows the same scheme. We use the same distributions as above and 
we consider the multi-armed bandit problem where one arm has distribution z/ l5 and the 
K — 1 remaining arms have distribution v 2 . Furthermore we set A = (K/n)^ for this part 
of the proof. Now we can use the same proof as for [Bubeck, 2010, Theorem 2.6] on the 
modified algorithm A' that runs on the Bernoulli distributions corresponding to v\ and v 2 - 
We leave the straightforward details to the reader. □ 



2.2 Median of means 

The truncated mean estimator and the corresponding bandit strategy are not entirely sat- 
isfactory as they are not translation invariant in the sense that the arms selected by the 
strategy may change if all reward distributions are shifted by the same constant amount. 
The reason for this is that the truncation is centered, quite arbitrarily, around zero. If the 
raw moments Ex~i,J^| 1+e are small, then the strategy has a small regret. However, it would 
be more desirable to have a regret bound in terms of the centered moments E^i/jX — 
This is indeed possible if one replaces the truncated mean estimator by more sophisticated es- 
timators of the mean. We show one such possibility, the "median-of-means" estimator in this 
section. In the next section we discuss Catoni's M-estimator, a quite different alternative. 

The median-of-means estimator was proposed by Alon et al. [2002]. The simple idea is to 
divide the data into various disjoint blocks. Within each block one calculates the standard 
empirical mean and takes a median value of these empirical means. The next lemma shows 
that for certain block size the estimator has the property required by our robust UCB strategy. 
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Lemma 2 Let 5 G (0, 1) and e G (0, 1]. Let Xi, . . . ,X n be i.i.d. random variables with mean 
W.X = fx and centered (1 + e)-th moment E|X — /i| 1+e = u. Let k = [8 log^ 1 / 8 /^) A n/2\ 
and N = \n/k\. Let 



N 2N kN 



1 v— -«, 1 v \ 1 \ 

fa = N^2 Xt > fa = n S Xt fa = S X * 

t=l t=JV+l t=(fe-l)jV+l 

fre empirical mean estimates, each one computed on N data points. Consider a median fiM 
of these empirical means. Then, with probability at least 1 — 5, 



n 



Proof. Let rj > and Yg = l^ >At+?? for £ G {1, . . . , k}. According to equation (12) in the 
Appendix, Yi has a Bernoulli distribution with parameter 

p < 



N e rj 1+e ' 
Note that for 

we have p < 1/4. Thus using Hoeffding's inequality for the tail of a binomial distribution, 

we get 

P(/2 M > n + rj) = P f ^ F £ > fc/2J < exp (-2^(1/2 - p) 2 ) < exp(-Jfe/8) = 5 . 

□ 

The next performance bound is a straightforward consequence of Proposition 1 and 
Lemma 2. In some situations it significantly improves on Theorem 1 as the bound depends 
on the centered moments of order 1 + e rather than on raw moments. 

Theorem 3 Let e G (0,1] and v > 0. Assume that the reward distributions ui,...,uk 
satisfy 

E x ^|X-^| 1+e <t,, VzG{l,...,K} . 

Then the regret of the Robust-UCB policy used with the median- of -means mean estimator 
defined in Lemma 2 satisfies 



R n < Yl [32(^)"logn + 5A i 

i:A i>0 \ V A i/ 
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2.3 Catoni's M estimator 



Finally, we consider an elegant mean estimator introduced by Catoni [2010]. As we will see, 
this estimator has similar performance guarantees as the median-of-means estimator but 
with better, near optimal, numerical constants. However, we only have a good guarantee in 
terms of the variance. Thus, in this section we assume that the variance is finite and we do 
not consider the case e < 1. 

Catoni's mean estimator is defined as follows: Let ip : R — > R be a continuous strictly 
increasing function satisfying 

- log(l - x + x 2 /2) < tp(x) < log(l + x + x 2 /2) . 

Let 5 G (0, 1) be such that n > 21og(l/5) and introduce 



as 



21og(l/<5) 



/ 2„log(l/g) 
\ ^ n-21og(l/5) ■ 



If Xi, . . . ,X n be i.i.d. random variables, then Catoni's estimator is defined as the unique 
value /Ic* = jj>c(n, 8) such that 



i=l 



Catoni [2010] proves that if n > 41og(l/<5) and the Xi have mean \i and variance at most v, 
then, with probability at least 1 — 5, 



IvIorS' 

He < A* + 2i 



n 

and a similar bound holds for the lower-tail. This bound has the same form as in As- 
sumption 1, though it only holds with the additional requirement that n > 41og(l/<5) and 
therefore it does not fomally fit in the framework of the robust UCB strategy as described in 
Section 2. However, by a simple modification, one may define a strategy that incorporates 
such a restriction. In Figure 2 we describe a policy based on Catoni's mean estimator. The 
policy assumes that there is a known upper bound v for the largest variance of any reward 
distribution. Then by a simple modification of the proof of Proposition 1, we obtain the 
following performance bound. 

Theorem 4 Let v > 0. Assume that the reward distributions v 1 , . . . ,i>k satisfy 

Ex^jX-^l 2 <v,Vi e {1,...,K}. 
Then the regret of the modified robust UCB policy satisfies 



i:A,>0 V A ' 



The regret bound has better numerical constants than its analogue based on the median- 
of-means estimator. However, a term of the order Y2i l°g n appears due to the restricted 
range of validity of Catoni's estimator. 
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Modified robust UCB: 

For arm i, define /ij. Sjf as Catoni's mean estimate j2c(s,t~ 2 ) based on the first s observed 
values Xi t i, . . . , X i)S of the rewards of arm i. Define the index 



3 Discussion and conclusions 

In this work we have extended the UCB algorithm to heavy-tailed stochastic multi-armed 
bandit problems in which the reward distributions have only moments of order 1 + e for 
some e G (0, 1]. In this setting, we have compared three estimators for the mean reward 
of the arms: median-of-means, truncated mean, and Catoni's M-estimator. The median-of- 
means estimator gives a regret bound that depends on the central (1 + e)-moments of the 
reward distributions, without need of knowing bounds on these moments. The truncated 
mean estimator, instead, delivers a regret bound that depends on the raw (1 + e)-moments, 
and requires the knowledge of a bound u on these moments. Finally, Catoni's estimator 
depends on the central moments like the median-of-means, but it requires the knowledge of 
a bound v on the central moments, and only works in the special case e = 1 (where it gives 
the best leading constants on the regret). A trade-off in the choice of the estimator appears 
if we take into account the computational costs involved in the update of each estimator 
as new rewards are observed. Indeed, while the truncated mean requires constant time 
and space per update, the median-of-means is slightly more difficult to update, requiring 
0(log5 _1 ) space and 0(loglog£ _1 ) time per update. Finally, Catoni's M-estimator requires 
linear space per update, which is an unfortunate feature in this sequential setting. 

It is an interesting question whether there exists an estimator with the same good concen- 
tration properties as the median-of-means, but requiring only constant time and space per 
update. The truncated mean has good computational properties but the knowledge of raw 
moment bounds is required. So it is natural to ask whether we may drop this requirement 
for the truncated mean or some variants of it. Finally, our proof techniques heavily rely on 
the independence of rewards for each arm. It is unclear whether similar results could be 
obtained for heavy-tailed bandits with dependent reward processes. 

While we focused our attention on bandit problems, the concentration results presented 
in this work may be naturally applied to other related sequential decision settings. Such 
examples include the racing algorithms of Maron and Moore [1997], and more generally 
nonparametric Monte Carlo estimation, see Dagum et al. [2000] and Domingo et al. [2002]. 
These techniques are based on mean estimators, and current results are limited to the ap- 




for s,t > 1 such that s > 81ogt and Bi :Sjt = +oo otherwise. 



At time t, draw an arm maximizing f^T^t-i),*- 



Figure 2: Modified robust UCB policy. 
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plication of the empirical mean to bounded reward distributions. 



A Empirical mean 

In this appendix we discuss the behavior of the standard empirical mean when only moments 
of order 1+e are available. We focus on finite-sample guarantees (i.e., non-asymptotic 
results), as this is the key property to obtain finite-time results for the multi-armed bandit 
problem. 

Let X, Xi, . . . , X n be a real i.i.d. sequence with finite mean \x. We assume that for some 
e G (0, 1] and v > 0, one has K\X — fi\ 1+e < v. We also denote by u an upper bound on the 
raw moment of order that is E|X| 1+e < u. 

Lemma 3 Let jl be the empirical mean: 

n 

£=-;>>• 

n ^— ' 

t=i 

Then for any 5 G (0, 1), with probability at least 1 — 8, one has 
Proof. Let 77, a > 0, 

¥(p-fM>rj)< F(3t G {1, . . . , n} : \X t - y\ > a) + P f - J2(X t - ^)l| Xt _^|< a > 77 j . 

The first term on the right-hand side can be bounded by using a union bound followed by 
Chebyshev's inequality (for moments of order 1 + e): 

P(3t G {l,...,n} : \X t -fi\ >a)< n < — . 

On the other hand Chebyshev's inequality together with the fact that K(X — n)l\x~^\< a — 
— E(X — fj,)l\x-ij,\>a gi ye f° r the second term 

\X t -fi\<a > V 



3v \ 1+£ 
5n £ 



( 1 n 

\ n ti 

— ^E(^-/^)l|^l<aJ 

< E(X- ^) 2 l| X „ M |< a : (E(X-/i)l| X _ M |< a ) : 



^2 

E(X- /U ) 2 l, x _ H < a (E(X - //)l|x- M |>a) : 



+ 



rar/ 2 rf 



12 



By applying a trivial manipulation on the first term, and using Holder's inequality with 
exponents p = 1 + e and q = 1 + 1/e for the second term, we obtain that the last expression 
above is upper bounded by 

E\X - fi\ l+£ a l ~ £ (E|X-/i| 1+e ) T ^(P(|X-/i| > a))^ va x ~ £ v^v^ 
nrj 2 T] 2 ~ nrj 2 r\ 2 a 2£ 



Thus we proved that 
Taking a = nrj entails 



nv va 1 6 v 2 



P(/x - n > rf) < — - + + 



a i+e nT j2 rfa 2e 



2v 

FQM-fM>rj)< —j- + 



n £ r] 1+e \n £ rj 1+e 



Note that if nE ^ 1+E > 1 then the bound is trivial, and thus we always have 

nt-»>v)<^- (12) 

The proof now follows by straightforward computations. □ 

It is easy to see that the order of magnitude of (12) is tight up to a constant factor. 
Indeed, let 7 G (0, 1) and consider the distribution (1 — 7 1+£ )<5o + 7 1+e ^i/ 7 - Clearly for this 
distribution we have K\X — p\ l+e < 1, so (12) shows that for an i.i.d. sequence drawn from 
this distribution, one has 

- v > v) < 



n £ r] 1+e 



We can restrict our attention to the case where 77 > n 1+e , for otherwise the above upper 
bound is trivial. Now consider 7 = Note that we have fx = 7 e = < f] and 

in particular this implies I/7 = 2nr] > n(j] + //). From this last inequality and basic 
computations we get 

- n > 7]) > P(3 i E {1, . . . , n} : X, > n(rj + //)) 
>P(3iG{l,...,n}:Xi = l/7) 
= 1 - (1 -7 1+£ ) n 

= 1 — exp I n In I 1 



> 1 — exp 



(2nr]) 1+£ 

1 



n £ (27]) 1+£ 
1 ( 1 



n £ (2rj) 1+£ V^l 2 ^) 14 
which shows that (12) is tight up to a constant factor for this distribution. 

Clearly, the concentration properties of the empirical mean are much weaker than for the 
truncated empirical mean or the median-of-means. Indeed, while the dependency on n in the 
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confidence term is similar for the three estimators, the dependency on 1/8 is polynomial for 
the empirical mean and polylogarithmic for the truncated empirical mean and the median- 
of-means. As we just showed, this is not an artifact of the proof method, and the empirical 
mean indeed has polynomial deviations (as opposed to the exponential deviations of the two 
other estimators). This remark is at the basis of the theory of robust statistics and many 
approaches to fix the above issue have been proposed, see for example Hubcr [1964, 1981]. 
The empirical mean estimator has been previously applied to heavy-tailed stochastic bandits 
in Liu and Zhao [2011] obtaining polynomial, rather than logarithmic, regret bounds. 
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